Capitalization and punctuation restoration: a survey
نویسندگان
چکیده
Ensuring proper punctuation and letter casing is a key pre-processing step towards applying complex natural language processing algorithms. This especially significant for textual sources where are missing, such as the raw output of automatic speech recognition systems. Additionally, short text messages micro-blogging platforms offer unreliable often wrong casing. survey offers an overview both historical state-of-the-art techniques restoring correcting word Furthermore, current challenges research directions highlighted.
منابع مشابه
Recovering Capitalization and Punctuation Marks on Speech Transcriptions
This work addresses two metadata annotation tasks, involved in the production of rich transcripts: automatic capitalization, and punctuation marks recovery. The main focus concerns broadcast news, using both manual and automatic speech transcripts. Different capitalization models were analysed and compared, and results support the ideia that generative approaches capture the structure of writte...
متن کاملLSTM for punctuation restoration in speech transcripts
The output of automatic speech recognition systems is generally an unpunctuated stream of words which is hard to process for both humans and machines. We present a two-stage recurrent neural network based model using long short-term memory units to restore punctuation in speech transcripts. In the first stage, textual features are learned on a large text corpus. The second stage combines textua...
متن کاملAutomatic Recovery of Punctuation Marks and Capitalization Information for Iberian Languages
This paper shows experimental results concerning automatic enrichment of the speech recognition output with punctuation marks and capitalization information. The two tasks are treated as two classification problems, using a maximum entropy modeling approach. The approach is language independent as reinforced by experiments performed on Portuguese and Spanish Broadcast News corpora. The discrimi...
متن کاملRecovering capitalization and punctuation marks for automatic speech recognition: Case study for Portuguese broadcast news
The following material presents a study about recovering punctuation marks, and capitalization information from European Portuguese broadcast news speech transcriptions. Different approaches were tested for capitalization, both generative and discriminative, using: finite state transducers automatically built from language models; and maximum entropy models. Several resources were used, includi...
متن کاملU S C 154(1)) by 971 Days " Restoring Punctuation and Capitalization in Transcribed Speech "
(54) GENERATING PROSODIC CONTOURS FOR 6,871,178 B2 3/2005 Case et al. SYNTHESIZED SPEECH 6,975,987 B1 12/2005 Tenpaku et a1. 6,990,449 B2 1/2006 Case . 6,990,450 B2 l/2006 Case et al. (75) Inventors: Martin Jansclhe, New York, NY (US); 7,035,791 B2 400% Chazan et a1‘ Mlchael DRlley, New York, NY (Us); 7,062,439 B2 6/2006 Brittan et al. Andrew M. Rosenberg, Brooklyn, NY 7,076,426 B1 7/2006 Beutn...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Artificial Intelligence Review
سال: 2021
ISSN: ['0269-2821', '1573-7462']
DOI: https://doi.org/10.1007/s10462-021-10051-x